-
Notifications
You must be signed in to change notification settings - Fork 6.5k
Fix Transformer2DModel ada_norm #7578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| else: | ||
| conditioning = self.transformer_blocks[0].norm1.emb( | ||
| timestep, class_labels, hidden_dtype=hidden_states.dtype | ||
| ) | ||
| shift, scale = self.proj_out_1(F.silu(conditioning)).chunk(2, dim=1) | ||
| hidden_states = self.norm_out(hidden_states) * (1 + scale[:, None]) + shift[:, None] | ||
| hidden_states = self.proj_out_2(hidden_states) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this practically the same as what we're doing when norm_type == "ada_norm"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is similar, but ada_norm doesn't take class labels as an argument. I moved this to else because the original was if norm_type != ada_norm_single. From what I can tell that block still only supports the norm used in the original DiT implementation. It might be worth it to refactor and allow other norm types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can then still condition on if class_labels is not None or something like that no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If class_labels is None then you can't use AdaLayerNormZero. I'm not sure what the default norm should be when you want to condition on text or audio, but I picked AdaLayerNorm because it was similar to the zero variant without needing class labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about ada norm single i.e., the one used in PixArt Alpha?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use that one without additional arguments you get this error:
TypeError: PixArtAlphaCombinedTimestepSizeEmbeddings(
(time_proj): Timesteps()
(timestep_embedder): TimestepEmbedding(
(linear_1): Linear(in_features=256, out_features=1408, bias=True)
(act): SiLU()
(linear_2): Linear(in_features=1408, out_features=1408, bias=True)
)
) argument after ** must be a mapping, not NoneType
It requires the arugments resolution and aspect_ratio, but those could have a default of None because they aren't required when use_additional_conditions=False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think changing that block would be more appropriate?
@yiyixuxu WDYT?
| scale, shift = torch.chunk(emb, 2, dim=1) | ||
| x = self.norm(x) * (1 + scale[:, None, :]) + shift[:, None, :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain why this set of changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default argument for dim in torch.chunk is dim=0 so it was splitting on the batch dimension. The second line change is to broadcast the scale and shift to the correct shape. This is exactly what is done elsewhere in the implementation that scale and shift are used. I could change it to scale[:, None] and shift[:, None] to match the other places it is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually that caused some of the tests to fail. I will take a look at that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the test that fails. Is it intended for AdaLayerNorm to not support batch size? If you add a batch dimension to this test it passes with my changes.
diffusers/tests/models/test_layers_utils.py
Line 381 in 1c60e09
| timestep_1 = torch.tensor(1, dtype=torch.long).to(torch_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably because of how the integration was done. During that process, the no.1 priority is to get the model integrated in the library. So, exhaustivity isn't at all prioritized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Left some comments.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Something I want to add is that I'm not partial to |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
Gently pinging @yiyixuxu |
|
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
|
Was this ever fixed? |
|
I'm not sure if this specifically was fixed, but there are several DiT models to choose from now. |
|
Got it thanks, will use the DiTTransformer2DModel or something else |
|
That's what I ended up doing. |
|
Doesn't # Validate inputs.
if norm_type != "ada_norm_zero":
raise NotImplementedError(
f"Forward pass is not implemented when `patch_size` is not None and `norm_type` is '{norm_type}'."
)
elif norm_type == "ada_norm_zero" and num_embeds_ada_norm is None:
raise ValueError(
f"When using a `patch_size` and this `norm_type` ({norm_type}), `num_embeds_ada_norm` cannot be None."
) |
What does this PR do?
This is a fix for an issue I opened earlier today. I'll link it down below.
Fixes # (issue)
#7575
Before submitting
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.